JOHN McQUILLEN INTERVIEW 
Okay. Well, you're taking me into a pretty technical area. You know that. 
Yeah. And I as a non-technical person must tape it. Yeah. 
All right. Well, gosh, I haven't worked on this for more than 20 years. I'll see if I 
can remember it. 
I know. It's been very tough, John, getting everyone's memories. So you're not alone. 
Okay. So you want me to explain the s_urious _O_ck. .The way the IMPs 
communicated was to send a packet from one IMP to the net and to receive an 
acknowledgement back and the storm forward protocol. IMPs/would have several 
packets in queue and would send those packets off, and as acknowledgements came 
back, would release the packets from the queue. If an acknowledgement wasn't 
received in a certain amount of time, then the packet would be retransmitted. So it's 
a positive acknowledgement retransmission system. Things were working fine until 
we changed the software one time. 
You and Crowther? 
I changed the software one time. It was me. 
And you were working at BB&N at the time? 
Yeah. I worked at BB&N for ten years. I had the responsibility for writing all the 
IMP software and making all the releases into the network fl'om something like 1972 
to 1974. This problem came up in probably '71 or '72. The problem was that the 
IMP would get an ack back and there wasn't a packet for it. It really shouldn't have 
happened. The system was very simple. You send the packets out and they all came 
back. I had developed a kind of a clever way of having each packet that was coming 
back in the reverse direction carry acknowledgements in it, and in fact, it could carry 
several acknowledgements in it very inexpensively in (what we first had.) So we 
thought that was very clever and we thought that the system was kind of simple and 
fool proof. We had the spurious ack problem that was very, very unlikely. It took 
testing the system extensively and intensively to make it fail at all. I had to build 
fairly elaborate networks in the laboratory to make it fail and I had to run it for 
hours. 
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You had to simulate failure. 
I had to simulate a real network with a lot of real traffic to get it to fail. Then it 
would fail once and I'd be left with a sort of bunch of computer corpses and I 
couldn't figure out what the problem was. It doesn't reflect that well on me, because 
it took me days and days and ultimately weeks of really intensive effort of being 
there night and day, having only a few failures to work with. Until I finally -- I kept 
tracking it down. And finally I discovered what the problem was. The problem was 
an interaction between the hardware and the software. You have to imagine that as 
packets are being sent out from the IMP, they're going out from memory, of course, 
and as packets are coming in, they're coming into memory, and as packets are being 
released from the queue when they're acknowledged, they're freed, and are put into 
a pre-place in memory. 
The problem was that there was an interaction between the packets flowing in and 
out on the telephone lines and the time that the computer was using the buffers in 
memory. I'm not sure if I can explain enough of this to you. But the problem was 
that there was a race condition between when the computer freed the buffer and 
when the telephone line would start to fill it with data or start to empty it with data. 
I believe that the problem was this. That the packets were put on the queue to be 
sent and if they weren't acknowledged, they were sent again after a certain amount 
of time. I thought I'd been clever in deciding that if the IMP had nothing else to do, 
it could keep sending some of the old packets. The problem came about that if the 
IMP was sending one of these old packets and an acknowledgement came in for it, 
the IMP would then free that packet -- it's memory to be reused. If while the packet 
was still being transmitted by the telephone line, the memory that the packet was 
sitting in was then reused by another packet, it would get partially over written. So 
you have to imagine that the telephone line has already gotten this piece of memory 
that it's starting to send out from that piece of memory, and now if the computer 
gives this memory to somebody else and they start writing into it, what will actually 
flow out into the telephone line is half of one packet and half of another packet. 
Garbage. This only happened in the kind of race condition that the package was 
starting to be transmitted and it was acknowledged and if the memory was reused 
all within a few milliseconds of each other, so that the packet would get corrupted. 
It manifested itself in a couple of different ways. But the main way was that it would 
appear to be carrying acknowledgements that were meaningless, because some other 
packet had over written things on the IMP. 
So the cure for that was -- there were multiple cures for it. One was just interlock 
things so that we didn't use the same memory for multiple things at the same time. 
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We didn't understand that we were, but we changed that. But more generally, we 
then put in a lot of integrity checks in the IMP to make sure that the packets were 
correct. 
An interesting effect thing here is -- I don't know about interesting, but sort of 
technically interesting effect is that if you have this patch of memory that you've 
said, "Okay, send out this packet." You've learned about check (sums) and how these 
things are properly check summed. Okay. So there's a check sum of the bits that 
are flowing out of memory onto the wire. Now the bits that were flowing out of this 
memory were sent over the line correctly. So the check sum was okay. It's that we 
were sending the wrong bits. So we ended up putting in a second level of check sum, 
which was a check sum in software on the intended contents from memory. So we 
have one check sum to make sure there isn't a telephony error in transmission. 
Then we had a second check sum to make sure that there wasn't an integrity error 
in hardware or software in the IMP. That turned out to be very useful later for 
catching all kinds of hardware malfunctions and other things. 
Who were you working with? 
I was working with all of these guys. Will Crowther and Dave Walden and others. 
At this point, Will and Dave had stopped programming the IMP and I was 
programming it. 
Were you actually at BB&N when the IMP was built? 
I joined in the middle of '71. 
Oh, okay. So it was after it had been installed. 
It was after the first hardware and software was out there, and then, you know, it 
got entirely wrinkled multiple times. Right. So I wasn't part of the team that 
developed it in '68 and '69. But I was the one who developed the level two and level 
three protocols and the routing algorithms which were actually used for most of the 
1970's. 
Is this stuff that you had studied in college? 
Yeah. I have my undergraduate and Masters and PhD degrees from Harvard in 
Computer Science. I got my undergraduate degree at Harvard in 1970 and as a grad 
student, I connected Harvard's PDP-1 to the Arpanet. 
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Really? 
Ben Barker was at Harvard also. You've interviewed him? 
Uh-huh. 
And I was introduced to the Arpanet and IMPs and BBN by Ben. As a grad student, 
I built the hardware and software to connect this computer to the Arpanet and then 
I decided to go to BBN after I got my master's de rafien over the next thi- .... 
years, from '71 to '74, I developed a whole/ suite of protocols for the IMP. 
Basically, I rewrote the IMP. 
Oh, really?  -'-'x) 
And then I wrotethe ne rouw ting algoritandI did my PhD work at Harvard on 
that part time. That is, I ts._wqdnfull time at BBN and I did my thesis part 
time. 
What is the story of everything getting routed to Harvard at some point? 
Yeah. 
That's another story. 
That's another story and it's kind of related to this. Now that we've talked about 
packets and acks, imagine it as a special kind of a packet which just says, "From this 
IMP, how far away are all the other IMPs," and this says, "Okay, for number one and 
two and three and four," you know, "I'm six away from this and four away from that." 
And it chooses the shortest route. 
Once all the IMPs have all this information, then they can all choose the best route. 
Does this data constantly update itself?. 
Yeah. 
It has to. 
Right. And this is the stuff that I changed. But before I explain what I changed, I'll 
.) 
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just explain briefly what happened here. It's closely related to this problem. We 
thought that these routing tables were safe from errors and corruption because they 
have this hardware check sum -- any errors introduced by telephony problems. But 
what happens if a computer's memory fails, so that every now and again once it just 
reads out all zeros? Okay? I mean, you know, computers fail in all different ways. 
The IMP at Harvard had a faulty memory buzz. So that every so often, when you 
went to read a memory location, you'd get a zero. In fact, you'd get aH zeros. So 
it would construct a routing packet here that was a perfectly legitimate, syntactically 
correct routing packet with all zeros. Then it would go out over the network nicely 
check summed. All the other IMPs would get it and they would say, "Boy, does 
Harvard have a really great route to all these other place." And they'd just send all 
the traffic there and it became a black hole, because all the traffic went there, and 
like a black hole, we couldn't even get information out. See, once all the traffic is 
going there, even the kind of network management and control traffic, which we 
were using to remotely diagnose and debug, gets sucked into the gravitational orbit. 
So it's just like a black hole where you can't see it or observe it. You just know that 
something bad has happened. 
Eventually, we had to kind of (cauterize) the network, cutting off that part of the 
network and then rebuilding it. Then we learned our lesson. As I say, these two 
stories are somewhat related. That one by one, we had to fix all the problems that 
should never even happen. You know, so the IMP was a very good lesson in a couple 
of things. One is truly distributed computations, so that in this first example of the 
spurious ack, we have two computers that are working on the same problem at the 
same time. Did this packet get through properly? If they're doing that, you have 
to worry about all the potential sequencing and simultaneity of the different things 
that they might do. 
Then you have the routing problem, which is you might have 50 computers all 
working on the same problem at the same time, which.is, you know, what's the best 
way to get to Utah? And if any one of them has a failure -- a heart rate failure -- 
you just have to make sure that the calculation proceeds. You just know that the 
computer's are going to have lightening storms, and power failures, and software 
bugs, and hardware bugs, and the janitor's going to trip the power cord, and just 
anything you can think of could happen. So we had to completely bullet-proof this 
distributed calculation so that it would keep working in the face of, as I say, quote 
"impossible problems." And we actually did that over a period of a couple of years 
and there were a few other funny, hah-hah, episodes where the whole network 
crashed because of problems like this. Then we finally got it completely licked. 
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So this caused the network to crash when this happened, because everything -- or did it? 
Basically, yes. 
What year was this? 
Oh, this was very early. '73 or something. 
So there were very few nodes. 
X'Yes. And also, in this whole period of many years that I was involved in it, there 
were only two or three -- there were three examples of problems like this, and they 
took the network down for a matter of minutes or an hour until we just could 
basically cut it off and put it back together again. Which is a very good record, 
because it was already working perfectly fine except for really pathological 
conditions. Usually you'd expect if a computer is reading out all zeros that it'll just 
stop, because it can't read its own instructions or anything. So we're into this very, 
very intermediate case where it has a momentary amnesia, but then it keeps going, 
and that's not supposed to happen. You know, a computer's supposed to work or 
not work. I wish I could remember the other two, because they're all kind of 
beautiful examples of this sort of thing. Let me see if I can remember. 
Maybe Walden might remember or Alec. 
He might. But in any event, there were a couple of others of this type where the 
problem was that we needed to basically guarantee that no matter what the other 
computer said that was nonsense, we had to keep going. 
When you say you basically rewrote the IMP, what does that entail? 
Well, what it entails is that I made a completely new IMP program and I installed 
it into all the computers in the network. 
That's a huge job. 
50 times over a period of two years. I did it every two weeks. 
You had to go each site? 
No, no. I did it all remotely here. So Tuesday morning at 6:00 every other week, 
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I would bicycle into BBN and I would load up the first computer with my new 
software and then I would propagate it out. And this is a whole art form in itself, 
because if you think of the network, which in the early days, I don't know when I 
was starting it might have had 20 or 30 nodes. But by the end of a couple of years 
later, I'm sure it had 50 or 60 nodes. Here we are in Cambridge (hoof) on one side; I, 
you know, so this is all running old. And then you put new in here and then you put 
new in here and new in here. Well, of course, you have to have new work with new 
and you have to have old work with old, and you have to have old work with new, 
all at the same time. Because otherwise you can't then send the new program down 
here. So this is what we call backwards compatibility and forwards compatibility, 
and every time I had a change to make, if you think about a new change in a format 
of a message, you know, the message is going to start with this kind of indication 
instead of that kind. It'll start with acknowledgements instead of something else. I'd 
have to construct one or potentially more releases of software staged over a couple 
of hours or sometimes I'd do kind of one release and then I'd come back two weeks 
later and do another release. Every now and again, I'd put a release out at 6:00 and 
I'd get out to here and it would start not to work too well. We'd have to pull it 
back. So the releases had to be reversible so you could go back and put old in here 
and old in here. This whole things was really a lot of fun. This was good stuff. To 
be able to do all this and to do it all from Cambridge and everybody else would get 
in, in the morning, and the network would be fine and they wouldn't know whether 
it was the old software or the new software. 
So when I say that I was writing it, what I mean is that Will and Dave and the 
others had written the whole thing in 1969, and it was up in 1970 and '71 when I got 
there, and the problems with it were clear. There were problems with the way the 
acknowledgements worked. There were problems with the way reassembly 
happened. There were problems with routing. And after I served a short 
apprenticeship rebuilding their Network Control Center -- or building their Network 
Control Center -- I built the first Network Control Center. 
Oh, you did? 
Yeah. So that was actually a key thing. So we had this Network Control Center 
over here. So I built that and then they sort of gave me the job of, "Okay, well, why 
don't you start taking on the assignment of building all these new protocols?" And so 
Crowther and I think some other people to a degree were involved in designing some 
of the new ideas and the